A Stepping Stone towards Visualization of Data

Hello, Everyone! This happens to be my first-ever blog post for which I am very excited. Writing and Setting up a blog was something that was in my mind for a while and finally after lots of failed attempts and sometimes excusable procrastination attempts I managed to write one.

This blog post will be about very important topics, which I believe shares a behemoth of an impact in creating a difference between a good analysis from a great one.

“Numbers have an important story to tell. They rely on you to give them a clear and convincing voice.” - Stephen Few

The post will try to over these aspects - An attempt to understand the visualization process put forward by Ben Fry - Visualizing Airports locations around the world - Visual Study of Flight Routes in-around India.

Stages To Visual Information

My interest in generative art, gravitated me towards Ben Fry initially to develop and understanding on how to present data in a more meaningful way. His book Visualizing Data provides a seven step process to create a narrative originated from data.

Let’s start covering each step one by one, and soon we will realize that these aren’t much of incremental steps as much as an intertwined processes which we return back to one after another.

Locating Airports around the World.

Acquire

Most of the data visualization originates from a question. It is important to have a question as it separates unnecessary constructs and provides a precise answer to the question.

Where are the Airports around the world located?

Lets acquire the data.

Before that I’ll load important packages.

library(XML)
library(ggplot2)
library(tidyr)
library(dplyr)
library(sp)
library(geosphere)
library('maps')
library('ggthemes')
library('plotly')

After a few minutes of google search, I happen to find a document that has all the coordinates of Airports locations around the globe. Let’s load that.

A_loc<-tbl_df(readLines("https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat"))
head(A_loc)
## # A tibble: 6 x 1
##                                                                         value
##                                                                         <chr>
## 1 "1,\"Goroka Airport\",\"Goroka\",\"Papua New Guinea\",\"GKA\",\"AYGA\",-6.0
## 2 "2,\"Madang Airport\",\"Madang\",\"Papua New Guinea\",\"MAG\",\"AYMD\",-5.2
## 3 "3,\"Mount Hagen Kagamuga Airport\",\"Mount Hagen\",\"Papua New Guinea\",\"
## 4 "4,\"Nadzab Airport\",\"Nadzab\",\"Papua New Guinea\",\"LAE\",\"AYNZ\",-6.5
## 5 "5,\"Port Moresby Jacksons International Airport\",\"Port Moresby\",\"Papua
## 6 "6,\"Wewak International Airport\",\"Wewak\",\"Papua New Guinea\",\"WWK\",\

The data happens to be quite messy. But we managed to pass the first stage we have acquired the data. If we look carefully we can actually see the country, co-ordinates of the locations of the airports.

Parse

The next step will be to provide a structure to the acquired data, and to place them is the specific order that makes sense to us. An easy way to test that if our data has structure is to look at the parsed dataset and see if one can mentally “plot” something out of it.

I separated the entire dataset into required columns as mentioned in the documentations.

New_A_loc<-as.data.frame(sapply(A_loc, function(x) gsub("\"", "", x)))
New_A_loc<-separate(data = New_A_loc, col = value, into = c("Airport_id", "Name","City","Country","IATA","ICAO","Lat","Long","Alt","Timezone","DST","TZ","Type","Source"), sep = ",")

New_A_loc$Lat <- as.numeric(New_A_loc$Lat)
New_A_loc$Long <- as.numeric(New_A_loc$Long)
New_A_loc$Alt<-as.numeric(New_A_loc$Alt)

head(New_A_loc)
##   Airport_id                                        Name         City
## 1          1                              Goroka Airport       Goroka
## 2          2                              Madang Airport       Madang
## 3          3                Mount Hagen Kagamuga Airport  Mount Hagen
## 4          4                              Nadzab Airport       Nadzab
## 5          5 Port Moresby Jacksons International Airport Port Moresby
## 6          6                 Wewak International Airport        Wewak
##            Country IATA ICAO       Lat    Long  Alt Timezone DST
## 1 Papua New Guinea  GKA AYGA -6.081690 145.392 5282       10   U
## 2 Papua New Guinea  MAG AYMD -5.207080 145.789   20       10   U
## 3 Papua New Guinea  HGU AYMH -5.826790 144.296 5388       10   U
## 4 Papua New Guinea  LAE AYNZ -6.569803 146.726  239       10   U
## 5 Papua New Guinea  POM AYPY -9.443380 147.220  146       10   U
## 6 Papua New Guinea  WWK AYWK -3.583830 143.669   19       10   U
##                     TZ    Type      Source
## 1 Pacific/Port_Moresby airport OurAirports
## 2 Pacific/Port_Moresby airport OurAirports
## 3 Pacific/Port_Moresby airport OurAirports
## 4 Pacific/Port_Moresby airport OurAirports
## 5 Pacific/Port_Moresby airport OurAirports
## 6 Pacific/Port_Moresby airport OurAirports

Looks good! We provided specific structure, each Airport has now an ID, its geolocation, Name, City etc.

As mentioned before though those being the 7 steps, there is no rule which states that we specifically need to follow them in particular order or follow all of these. They provide a framework to work on number and provide them with a different persona.

As now we have a proper dataset, I will directly jump to the representation of those on a map.

Represent

This is where we will have our first visualization. We will form a scatter plot of airport location of a world map. Let’s work around that.

I prefer ggplot, over base plotting system. It’s more resourceful and functionally accessible.

We have Latitudes and Longitudes of each airport location. GGplot has a map function that lets us pinpoint coordinates precisely on a world map. We can further customize it to our setting too!

world <- ggplot() +
  borders("world", colour = "#3e3e40", fill = "#3e3e40") +
  theme(panel.background = element_rect(fill = "#252526", colour = "#252526"),panel.grid.minor = element_blank(),panel.grid.major = element_blank(),axis.title.y = element_blank(),axis.title.x  = element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
map <- world +
  geom_point(aes(x = Long, y = Lat,
                 text = paste('City: ', City,
                              '<br /> Name : ', Name),
                 ID = Airport_id),
             data = New_A_loc, colour = "#ebe6ed", alpha=1/4,size=0.4)+labs("Airport")

map

Looks good! So many Airports!!

Filter

Sometimes, it’s better to be precise with limited data, then to be inaccurate with a massive dataset. Filtering removes with is not useful or rather which doesn’t have much impact on the overall aspect of visualization. I will filter Airports, that is located in India.

Interact

Moving on, The last step is usually tricky and sometimes is quite underrated. But the interaction of a visual graphic has a profound impact. Letting user given a control over that visual functionality. It provides both depth and breath to the visual mechanics of a plot.

For that, we will use the Plotly package in R.

ggplotly(map, tooltip = c('text', 'ID'))

And finally, we have our first visualization. The aesthetics can be further improved I guess, I am bad with color, I can clearly see that.

Tracing Flight Routes

I found Aaron Koblin’s Flight Patterns to be an amazing masterpiece, so simple yet so informative. So this is my short attempt to replicate his work using just the routes and not the entire plane schedule( I was not able to access those).

Mine and Filter

There is an important concept called the great circles which gives us a route for airlines to follow from one point to another. Mine is a step which incorporates mathematics and statistics to uncover more details about something.

Here I will only consider 2 main airports to observe route. Delhi and Mumbai

  dat_point<-subset(New_A_loc,IATA==c("DEL","BOM"),select=c(Long,Lat))
  gg4<-map+coord_cartesian(ylim = c(0,60),xlim=c(40,120))+geom_point(data=dat_point,aes(x=Long, y=Lat),color="#ebe6ed",alpha=1/5,size=2)
  

  l<-tbl_df(gcIntermediate(c(dat_point$Long[1],dat_point$Lat[1]),c(dat_point$Long[2],dat_point$Lat[2]),n=100,addStartEnd = TRUE,sp=FALSE))
  gg5<-gg4+geom_line(data=l,aes(x=lon,y=lat),color="white")
  gg5

Till now we have managed to cover 6 of 7 steps in Visualization of Data.

But again the proces is not unidirectional.

Acquire

routes<-tbl_df(readLines("https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat"))
routes<-separate(data = routes, col = value, into = c("Airline", "Airline_iD","Source_airport","Source_airport_id","Destination_airport","Destination_airport_id","Codeshare","Stops","Equipment"), sep = ",")

Refine

Refine aspect cover improving the visual features to clarify representation.

routes[ routes == "\\N" ] <- NA

Routes_source<-routes[,4]
names(Routes_source)<-"Airport_id"
Routes_destination<-routes[,6]
names(Routes_destination)<-"Airport_id"

Filter

Airport<-New_A_loc[,c(1,4,7,8)]

d1<-left_join(Routes_source,Airport,by="Airport_id")
d2<-left_join(Routes_destination,Airport,by="Airport_id")
Df<-cbind(d1,d2)
names(Df)<-c("Airport_id_in","Country_in","Lat_in","Long_in","Airport_id_out","Country_out","Lat_out","Long_out")
Df<-tbl_df(Df[complete.cases(Df),])

In<-subset(Df, Country_in =="India")
Out<-subset(Df,Country_out =="India")

## Incoming Flights

Mine

  l<-gcIntermediate(cbind(In$Long_in,In$Lat_in),cbind(In$Long_out,In$Lat_out),n=100,addStartEnd = TRUE,sp=TRUE)
  
  d_l<-SpatialLinesDataFrame(l,
                            data.frame(A_id_in = In$Airport_id_in,
                                       A_id_out = In$Airport_id_out,
                                       stringsAsFactors = FALSE))
  d_l_df <- fortify(d_l)
  gg6<-world+geom_path(data=d_l_df,aes(long, lat , group = group),alpha=0.05,color="white")
  
  gg6_t<-map+geom_path(data=d_l_df,aes(long, lat , group = group),alpha=0.05,color="white")
# Outgong Flights
  l<-gcIntermediate(cbind(Out$Long_in,Out$Lat_in),cbind(Out$Long_out,Out$Lat_out),n=100,addStartEnd = TRUE,sp=TRUE)
  
  d_l<-SpatialLinesDataFrame(l,
                             data.frame(A_id_in = Out$Airport_id_in,
                                              A_id_out = Out$Airport_id_out,
                                              stringsAsFactors = FALSE))
  
  d_l_df <- fortify(d_l)

Represent

 gg6<- gg6+geom_path(data=d_l_df,aes(long, lat , group = group),alpha=0.05,color="white")

gg6+coord_cartesian(ylim = c(0,60),xlim=c(40,120))

Not exactly what I had in mind. But it somewhat tries to imitate what Aaron Koblin attempted.

I hope you liked my first attempt at writing blog post and most of all my attempt in trying to explain the basic framework for visualization. Let me know what you think about this.